In this project we are going through the covid-19 data from the John Hopkins University to build a global analytics dashboard. This project is divided in 4 parts:
We start our project by loading the needed packages and data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import dash
from dash import dcc
from dash import html
from dash.dependencies import Input, Output
import dash_bootstrap_components as dbc
from datetime import datetime, timedelta
import pycountry
covid_df = pd.read_csv('data/covid_tidy_data.csv')
covid_df.tail()
| Country | Date | Cases_Confirmed | New_Cases_Confirmed | Cases_Death | New_Cases_Death | Cases_Recovered | New_Cases_Recovered | Code | |
|---|---|---|---|---|---|---|---|---|---|
| 225166 | Zimbabwe | 2023-03-05 | 264127 | 0 | 5668 | 0 | 0 | 0 | ZWE |
| 225167 | Zimbabwe | 2023-03-06 | 264127 | 0 | 5668 | 0 | 0 | 0 | ZWE |
| 225168 | Zimbabwe | 2023-03-07 | 264127 | 0 | 5668 | 0 | 0 | 0 | ZWE |
| 225169 | Zimbabwe | 2023-03-08 | 264276 | 149 | 5671 | 3 | 0 | 0 | ZWE |
| 225170 | Zimbabwe | 2023-03-09 | 264276 | 0 | 5671 | 0 | 0 | 0 | ZWE |
pop_df = pd.read_csv('data/world_population.csv')
pop_df.tail()
| Rank | CCA3 | Country/Territory | Capital | Continent | 2022 Population | 2020 Population | 2015 Population | 2010 Population | 2000 Population | 1990 Population | 1980 Population | 1970 Population | Area (km²) | Density (per km²) | Growth Rate | World Population Percentage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 229 | 226 | WLF | Wallis and Futuna | Mata-Utu | Oceania | 11572 | 11655 | 12182 | 13142 | 14723 | 13454 | 11315 | 9377 | 142 | 81.4930 | 0.9953 | 0.00 |
| 230 | 172 | ESH | Western Sahara | El Aaiún | Africa | 575986 | 556048 | 491824 | 413296 | 270375 | 178529 | 116775 | 76371 | 266000 | 2.1654 | 1.0184 | 0.01 |
| 231 | 46 | YEM | Yemen | Sanaa | Asia | 33696614 | 32284046 | 28516545 | 24743946 | 18628700 | 13375121 | 9204938 | 6843607 | 527968 | 63.8232 | 1.0217 | 0.42 |
| 232 | 63 | ZMB | Zambia | Lusaka | Africa | 20017675 | 18927715 | 16248230 | 13792086 | 9891136 | 7686401 | 5720438 | 4281671 | 752612 | 26.5976 | 1.0280 | 0.25 |
| 233 | 74 | ZWE | Zimbabwe | Harare | Africa | 16320537 | 15669666 | 14154937 | 12839771 | 11834676 | 10113893 | 7049926 | 5202918 | 390757 | 41.7665 | 1.0204 | 0.20 |
The covid_df already contains a tidy dataset and we went to this process in another project that can be seen here. Now let us merge the dataframes. We will only be using the 2020 population, since it was the year of the pandemic start and its peak.
covid_df = covid_df.merge(pop_df[['CCA3', '2020 Population']], left_on='Code', right_on='CCA3')
covid_df.tail()
| Country | Date | Cases_Confirmed | New_Cases_Confirmed | Cases_Death | New_Cases_Death | Cases_Recovered | New_Cases_Recovered | Code | CCA3 | 2020 Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 222880 | Zimbabwe | 2023-03-05 | 264127 | 0 | 5668 | 0 | 0 | 0 | ZWE | ZWE | 15669666 |
| 222881 | Zimbabwe | 2023-03-06 | 264127 | 0 | 5668 | 0 | 0 | 0 | ZWE | ZWE | 15669666 |
| 222882 | Zimbabwe | 2023-03-07 | 264127 | 0 | 5668 | 0 | 0 | 0 | ZWE | ZWE | 15669666 |
| 222883 | Zimbabwe | 2023-03-08 | 264276 | 149 | 5671 | 3 | 0 | 0 | ZWE | ZWE | 15669666 |
| 222884 | Zimbabwe | 2023-03-09 | 264276 | 0 | 5671 | 0 | 0 | 0 | ZWE | ZWE | 15669666 |
We ended up with less rows than before because some countries lack of entries in the countries population file. We can't get the incidence, mortality and lethality rates since those will have to be calculated according to the appied filters in the dashboard. To finish this first part, let us change the Date column type, get rid of the CCA3 column and alter the name of the population column.
covid_df['Date'] = pd.to_datetime(covid_df['Date'])
covid_df = covid_df.drop('CCA3', axis=1)
covid_df = covid_df.rename(columns={'2020 Population':'Population'})
covid_df.tail()
| Country | Date | Cases_Confirmed | New_Cases_Confirmed | Cases_Death | New_Cases_Death | Cases_Recovered | New_Cases_Recovered | Code | Population | |
|---|---|---|---|---|---|---|---|---|---|---|
| 222880 | Zimbabwe | 2023-03-05 | 264127 | 0 | 5668 | 0 | 0 | 0 | ZWE | 15669666 |
| 222881 | Zimbabwe | 2023-03-06 | 264127 | 0 | 5668 | 0 | 0 | 0 | ZWE | 15669666 |
| 222882 | Zimbabwe | 2023-03-07 | 264127 | 0 | 5668 | 0 | 0 | 0 | ZWE | 15669666 |
| 222883 | Zimbabwe | 2023-03-08 | 264276 | 149 | 5671 | 3 | 0 | 0 | ZWE | 15669666 |
| 222884 | Zimbabwe | 2023-03-09 | 264276 | 0 | 5671 | 0 | 0 | 0 | ZWE | 15669666 |
covid_df.to_csv('data/final_dataset.csv', index=False)
In this part of the project, we are going to generate the views that will be later used in the dashboard. We are going to write several functions in this part, so they can be reused later. Let us start by getting some KPIs with general information.
def get_cases(df):
return df['New_Cases_Confirmed'].sum()
def get_deaths(df):
return df['New_Cases_Death'].sum()
def get_incidence(df):
return round(get_cases(df)/df.groupby('Country')['Population'].max().sum(), 3)
def get_mortality(df):
return round(get_deaths(df)/df.groupby('Country')['Population'].max().sum(), 3)
def get_lethality(df):
return round(get_deaths(df)/get_cases(df), 3)
def get_kpi(indicator, title):
fig = go.Figure(go.Indicator(
mode = "number",
value = indicator,
number = {'font':{'size':36}, 'font_color':'black'},
title = {'text': title, 'font_size':16, 'font_color':'black'},
domain = {'x': [0, 1], 'y': [0, 1]}
),layout= go.Layout(height=50))
return fig
fig = get_kpi(get_cases(covid_df), 'Cases')
fig.show()
fig = get_kpi(get_deaths(covid_df.query("Country == 'Brazil'")), 'Deaths in Brazil')
fig.show()
fig = get_kpi(get_incidence(covid_df.query("Country == 'US'")), 'Incidence')
fig.show()
fig = get_kpi(get_mortality(covid_df.query("Country == 'San Marino'")), 'Mortality')
fig.show()
fig = get_kpi(get_lethality(covid_df.query("Date < '2021-01-01'")), 'Lethality in 2020')
fig.show()
Now let us get bubble map with tha case as the bubble size and the deaths as its color. We are also changing the numbers format in order to make them more readable.
def cases_deaths_by_country(df):
cases_suffix = ''
deaths_suffix = ''
temp_df = df.groupby(['Code', 'Country'])[['New_Cases_Confirmed', 'New_Cases_Death']].sum().reset_index()
if temp_df['New_Cases_Confirmed'].max()>10**6:
cases_suffix = 'M'
temp_df['New_Cases_Confirmed'] = temp_df['New_Cases_Confirmed']/10**6
elif temp_df['New_Cases_Confirmed'].max()>10**3:
cases_suffix = 'K'
temp_df['New_Cases_Confirmed'] = temp_df['New_Cases_Confirmed']/10**3
if temp_df['New_Cases_Death'].max()>10**6:
deaths_suffix = 'M'
temp_df['New_Cases_Death'] = temp_df['New_Cases_Death']/10**6
elif temp_df['New_Cases_Confirmed'].max()>10**3:
deaths_suffix = 'K'
temp_df['New_Cases_Death'] = temp_df['New_Cases_Death']/10**3
return [temp_df, cases_suffix, deaths_suffix]
def get_bubble_map(df_suffixex, title):
df = df_suffixex[0]
cases_suffix = df_suffixex[1]
deaths_suffix = df_suffixex[2]
fig = px.scatter_geo(df, locations="Code", color="New_Cases_Confirmed",
hover_name="Country",
hover_data={"Code":False,
"New_Cases_Confirmed":":.2f",
"New_Cases_Death":":.2f"},
size="New_Cases_Death",
color_continuous_scale=px.colors.sequential.YlOrRd,
title = title,
size_max = 40,
labels = {"New_Cases_Confirmed":"Cases ({})".format(cases_suffix),
"New_Cases_Death":"Deaths ({})".format(deaths_suffix)},
projection="natural earth")
return fig
get_bubble_map(cases_deaths_by_country(covid_df), "Cases and Deaths Around the World")
Now, to finish this part, let us get some line charts to get the pandemic evolution over time.
def get_cases_day(df):
return df.groupby('Date')['New_Cases_Confirmed'].sum().reset_index()
def get_deaths_day(df):
return df.groupby('Date')['New_Cases_Death'].sum().reset_index()
def get_line_chart(df, title, column):
fig = px.line(df, x='Date', y=column, labels={column:title}, title=title+" by Day")
return fig
get_line_chart(get_cases_day(covid_df), 'Cases', 'New_Cases_Confirmed')
get_line_chart(get_deaths_day(covid_df.query('Country=="Brazil"')), 'Deaths', 'New_Cases_Death')
Our dashboard will have two filters. One range slider selector for the date interval and one dropdown selector for the country. Since the Dash's range slide does not work with dates, we are going to create a dictionary to enable displaying the dates.
date_dict = {}
for i,date in enumerate(covid_df['Date'].unique()):
date_dict[i]=str(pd.to_datetime(date).date().strftime('%m/%d/%Y'))
list(date_dict.items())[-6:-1]
[(1137, '03/04/2023'), (1138, '03/05/2023'), (1139, '03/06/2023'), (1140, '03/07/2023'), (1141, '03/08/2023')]
Now we are going to create the countries filter options. This filter must include an "All" button, in order to allow us select all the countries.
countries = ['All']
countries.extend(list(covid_df['Country'].unique()))
countries[0:5]
['All', 'Afghanistan', 'Albania', 'Algeria', 'Andorra']
Now we are ready to generate our dashboard.
app = dash.Dash(external_stylesheets=[dbc.themes.CERULEAN])
app.layout = dbc.Container([
dbc.Row(dbc.Col(html.H1("Covid-19 Analytics Dashboard"), md=10), align="start", justify="evenly", style={"height": 100}),
dbc.Row([
dbc.Col(html.Div("Select the country:"), md=2),
dbc.Col(html.Div("Select the period:"), md=8)
], align="start", justify="evenly"),
dbc.Row([
dbc.Col(dcc.Dropdown(id='my-dropdown',
options=countries,
multi=False,
clearable=False,
value='All',
), md=2),
dbc.Col(dcc.RangeSlider(min=0,
max=1142,
value=[0, 1142],
marks={0:date_dict[0], 366:date_dict[366], 366+365:date_dict[366+365], 1142:date_dict[1142] },
id='my-slider'), md=8)
], align="start", justify="evenly"),
dbc.Row([
dbc.Col(dcc.Graph(id='cases_kpi', figure={}, responsive=True, style={"height": "100%"}), md=2, style={"height": "100%"}),
dbc.Col(dcc.Graph(id='deaths_kpi', figure={}, responsive=True, style={"height": "100%"}), md=2, style={"height": "100%"}),
dbc.Col(dcc.Graph(id='incidence_kpi', figure={}, responsive=True, style={"height": "100%"}), md=2, style={"height": "100%"}),
dbc.Col(dcc.Graph(id='mortality_kpi', figure={}, responsive=True, style={"height": "100%"}), md=2, style={"height": "100%"}),
dbc.Col(dcc.Graph(id='lethality_kpi', figure={}, responsive=True, style={"height": "100%"}), md=2, style={"height": "100%"})
], align="start", justify="evenly", style={"height": 100}),
dbc.Row([
dbc.Col(dcc.Graph(id='map', figure={}, responsive=True), md=10)
], align="start", justify="evenly"),
dbc.Row([
dbc.Col(dcc.Graph(id='cases_line', figure={}, responsive=True), md=10)
], align="start", justify="evenly"),
dbc.Row([
dbc.Col(dcc.Graph(id='deaths_line', figure={}, responsive=True), md=10)
], align="start", justify="evenly")
], fluid=True)
@app.callback(
Output('cases_kpi', 'figure'),
Output('deaths_kpi', 'figure'),
Output('incidence_kpi', 'figure'),
Output('mortality_kpi', 'figure'),
Output('lethality_kpi', 'figure'),
Output('map', 'figure'),
Output('cases_line', 'figure'),
Output('deaths_line', 'figure'),
Input('my-slider', 'value'),
Input('my-dropdown', 'value')
)
def update_output(period, country):
dt_ini=pd.to_datetime(date_dict[period[0]])
dt_fin=pd.to_datetime(date_dict[period[1]])
query = '(Date>=@dt_ini)&(Date<=@dt_fin)'
if country != 'All':
query = query + '&(Country==@country)'
fig_cases = get_kpi(get_cases(covid_df.query(query)), 'Cases')
fig_deaths = get_kpi(get_deaths(covid_df.query(query)), 'Deaths')
fig_incidence = get_kpi(get_incidence(covid_df.query(query)), 'Incidence')
fig_mortality = get_kpi(get_mortality(covid_df.query(query)), 'Mortality')
fig_lethality = get_kpi(get_lethality(covid_df.query(query)), 'Lethality')
fig_map = get_bubble_map(cases_deaths_by_country(covid_df.query(query)), 'Cases and Deaths by Country')
fig_cases_line = get_line_chart(get_cases_day(covid_df.query(query)), 'Cases', 'New_Cases_Confirmed')
fig_deaths_line = get_line_chart(get_deaths_day(covid_df.query(query)), 'Deaths', 'New_Cases_Death')
return fig_cases, fig_deaths, fig_incidence, fig_mortality, fig_lethality, fig_map, fig_cases_line, fig_deaths_line
app.run(debug=True, port=8051)
In this project, we have created an interactive analytics dashboard with covid-19 around the world data using the Dash package. Even though the package doesn't offer a lot of options and it is not as simple to use as some dataviz tools such as Tableau and Power BI, it pays off for it being free and very flexible.